A'flS/l-Cg- IV, MO 


NASA Contractor Report 172340 




UPPER AND LOWER BOUNDS FOR SEMI-MARk 
RELIABILITY MODELS OF RECONFIGURABLE SYSTEMS 




Allan L. White 





cotr 


KENTRON INTERNATIONAL, INC. 
Kentron Technical Center 
Hampton, Virginia 23666 


Contract NASI- 16000 
April 1984 



1984 


LANGLEY RESEARCH CENTER 
LIBRARY. NASA 
HA’.'.RTCN, VIRGINIA 


IM/VSA 

National Aeronautics and 
Space Administration 

Langley Research Center 

Hampton, Virginia 23665 



ABSTRACT 


This paper determines the information required about system recovery to 
compute the reliability of a class of reconfigurable systems. Upper and lower 
reliability bounds are derived for these systems. The class consists of those 
systems satisfying five assumptions: the individual components fail 

independently at a low constant rate, fault occurrence and system 
reconfiguration are independent processes, the reliability model is semi-Markov, 
the recovery functions which describe system reconfiguration have small means 
and variances, and the system is well designed. The derivation proceeds by 
considering paths through the reliability model from the initial, fault-free, 
state to the absorbing, system-failure, states. Since the probability of system 
failure is the sum of the probabilities of traversing these fatal paths, it 
suffices to obtain bounds on traversing a path by a given time. The bounds 
involve the component failure rates and the means and variances of the recovery 
functions. They are easy to compute, and illustrative examples are included. 






1. INTRODUCTION 


This paper determines the information needed about fault recovery to 
compute the reliability of a class of reconfigurable systems. 

Reconfigurable systems can identify a faulty component, remove it from the 
working group, and replace it with a spare if available. Typically, building 
the system is only justified if the reliability requirement is high— often high 
enough that natural life testing is impossible, and system reliability must be 
computed from a mathematical model that includes descriptions of component 
failure and system recovery. Hence the modeling problem consists of a complex 
system whose reliability requires careful computation. This combination 
suggests delicate experiments with hard statistical analyses to get a 
description of system fault recovery, followed by difficult calculations to get 
an estimate of system reliability. Even more important, it may not be clear 
what needs to be observed in the experiments and included in the calculations. 

Given certain assumptions about component and system behavior, this paper 
derives upper and lower bounds for the probability of system failure in terms of 
system operating time, component fault rates, and the means and variances of 
system fault recovery times. The assumptions used are common (see references 
Cl], [2], and [3]), and their plausibility is discussed below. However, their 
plausibility and common use do not mean the assumptions are valid, and more 
investigation is required before the derived bounds can be confidently applied 
to a reconfigurable system. 

The derivation of the bounds requires five assumptions: 1) components fail 

independently at a low constant rate; 2) component failure and system recovery 
are independent processes; 3) the system quickly recovers from all faults; 4) 
fault recovery depends only on time elapsed since fault occurrence; 5) the 
system is well designed. The first assumption is appropriate for high quality 
components operating for a short period of time in a benign environment, but may 
not be applicable otherwise. The second assumption is reasonable if failure is 
an instantaneous event— a component's imminent failure does not affect its 
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current performance. The third assumption on quick recovery describes a 
desirable property for reconflgurable systems since these systems fall If too 
many faults accumulate in the working group of components. If recovery is quick 
then the reconfiguration process has a small mean. If recovery is quick for all 
faults then the reconfiguration process has a small deviation from the mean, 
measured by the variance. Hence the third assumption has a mathematical 
version: 3') any system fault recovery has a small mean and variance. The 

fourth assumption, together with the first on constant rates, says the 
reliability model is semi-Markov. The major objection against a semi-Markov 
model is that fault recovery may depend on what the system is doing at the time 
of fault occurrence. A later section considers time dependent recovery and 
shows the same upper bound is still valid. Because the mathematics is more 
complicated for the time dependent case no attempt is made to derive a lower 
bound. The fifth assumption about the system being well designed means the 
system only fails when overwhelmed by faulty components. Conceivably, a system 
can fail to operate properly even if all the components are fault free. 

The next section presents an arbitrary path from the initial state to a 
failure state in a semi-Markov reliability model and derives upper and lower 
bounds for traversing the path by a given time. The probability of system 
failure is the probability of traversing all such fatal paths which means an 
upper bound for system failure is the sum of the upper bounds for all the paths, 
while a lower bound for system failure is the sum of the lower bounds for all 
the paths. Simple addition of the probabilities suffices because traversing one 
path is a disjoint event compared to traversing another path. The bounds 
established in the next section are partly numerical and partly algebraic. The 
numerical part consists of solving the simultaneous linear differential 
equations associated with a constant rate Markov model where all the rates are 
fairly close— an easy exercise for a computer numerical package. The algebraic 
part consists of expressions involving component fault rates and the means and 
variances of system recovery times. Section four derives purely algebraic 
bounds and discusses their accuracy. The algebraic upper bound is particularly 
easy to use, and it shows the influence of fault rates and recovery times on 
system reliability. Each of these sections is followed by a section containing 
an example. Section six shows that the same upper bound is still valid even if 
system fault recovery is time dependent. 
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Besides determining the information required about system recovery, the 
material below offers some other benefits. The upper and lower bounds are 
derived rigorously from the assumptions placing the resulting calculations on 
firm foundations. The bounds are proved for arbitrary recovery distributions 
with finite means and variances which eliminates concern over the applicability 
of a parameterlc model. The fault Injection experiments to study system 
recovery need only record the time between fault injection and system recovery 
with no information required about the intermediate steps. Since different 
system architectures produce reliability models with different paths to failure, 
the calculations based on paths to failure reflects the influence of 
architecture on reliability. (For examples, see references [6] and [7].) The 
bounds are easy to compute, and they use familiar mathematics and statistics: 
dlfffirfintisl equations, means, and variances. The algebraic upper bound used as 
an approximation formula allows computation from a mere inspection of the 
reliability model and reveals the influence of the various parameters on system 
reliability. The major disadvantage of the approach below is that it may not be 
able to handle transient and intermittent faults. 

Besides the references mentioned before, references [4] and [5] contain the 
necessary probability theory, while [8j and [9] present other approaches to the 
reliability of reconfigurable systems. 
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2. THE APPROXIMATION THEOREM 


Upper and lower reliability bounds are obtained by considering the paths in 
the reliability model that begin at the initial state and proceed to an 
absorbing state representing system failure. A general path, rearranged for 
notational convenience, is displayed in figure 1. Any transition on a path is 
by means of a fault occurrence competing with other fault occurrences, or by 
means of system recovery competing with fault occurrences, or by means of a 
fault occurrence competing with system recovery and other fault occurrences. In 
figure 1, the fault occurrence transitions are labeled by the component failure 
rates, and the system recovery transitions are labeled by the generalized 
densities of the recovery distributions. Figure 2 shows the first part of the 
path, consisting of just fault occurrence transitions with the absorbing state E 
replacing the non-absorbing state B x . As the absorbing state of a constant rate 
Markov process, the probability of being in state E by a given time is easy to 
compute. 

In the first third of figure 1, the \ 's are the rates of component failures 
that stay on the path, while the y's are those that lead off the path. In the 
second third, the dF's are the generalized densities of recovery transitions 
that stay on the path, while the e's are the rates of component failures that 
lead off the path. In the final third, the a's are the rates of component 
failures that stay on the path, while the dG's and 0 's represent recovery 
transitions and component failures that lead off the path. 
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Figure 1: A Path In a Semi-Markov Reliability Model 



Figure 2: The Constant Rate Markov Part of the Path 
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Let D(T) and E(T) be the probabilities of being in states D and E by time 
T. Suppose the distribution Fi has mean ui and variance a i 2 , and Gj has 
mean nj and variance xj 2 . Let 


A 


1/2 
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+...+ u 


1/2 


m 


_l/2 

^1 


+...+ n 


1/2 


and assume A < T. 

Theorem With the notation as above, 


E(T) < n a, n. 


j*l 
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> D(T) 
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Proposition Suppose H is a distribution function, H(x) = 0 for x < 0, and H has 
finite mean y and variance a 2 . Then, for e, o, $ > 0, 


(i) J a e" (a+B)x [l-H(x)]dx j< a y 
0 

(ii) / e" ex dH(x) < 1 
0 

1/2 

p 2 

(iii) J e” eX dH(x) > 1 - e y - 2— ± 
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Proof of the Proposition 


The derivation uses the standard results 
1 - x j< e~ x <_ 1 for x ^ 0 

00 

/ [1 - H( x) ] dx = ji 

0 

7 *[1 - H( x) ] dx = sL±A 
0 d 

1 - H(c) = / dH(x) _< — + y2 for c > 0. 

c c 2 

The proof of (iii) is 
p l/2 
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0 0 1/2 

v 
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The proof of (iv) is 


1/2 
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0 0 
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Proof of the Theorem 

Let q(t) be the density function of E(t). Since the path in figure 1 is 
from a semi-Markov process. 


T T-t 

D(T) = J J 
0 0 
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Working with just the limits of integration 


D(T) < / / ... / / ... / 
0 0 0 0 0 


and 


1/2 


1/2 1/2 


1/2 


T-A Hi H m ' Hi' n 

D(T> >/ /* ... / m /* ... /" . 

0 0 0 0 0 


To complete the proof write the multiple integrals as iterated integrals, and 
apply the inequalities in the proposition to the integrands. 
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3. EXAMPLE 


One of the simplest reconfigurable systems consists of a working triad plus 
a spare. The majority voting lets the triad detect a faulty member and maintain 
process control while replacing it with the spare. Figure 3 displays the first 
two failure states of the system. The mnemonics are I for the initial state, Q 
for a faulty component in the triad, R for system recovery, and D for system 
failure because of two faulty components in the triad. The transitions are 
labeled with either component failure rates or generalized densities of recovery 
functions. The vertical transitions refer to failure of the spare. 

There is one path to state Dj and one path to state 0 2 . The constant rate 
Markov part of these paths are given in figure 4 with Ej and E 2 as the absorbing 
states. 
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Figure 4: (a) The Constant Rate Part of the First Path 
(b) The Constant Rate Part of the Second Path 
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Let Hi have mean ui and variance o i 2 . The inequalities are 


Ej (T) {2X yi } 


> h (T) 

> (T - ui 1/2 ){2X (p x 


3X (a i 2 + vi 2 ) a! 2 + ui 2 
2 ^~ )] 


and 


E Z (T) {2Xp 2 } 

> D 2 (T) 

> E 2 (T - u} /2 -U2 /2 ) 

X{ 2X [1 - 3X Wl - 01 * U1 ] [y 2 - X (o 2 2 + y 2 2 ) - j] > 

111 1/2 

U2 

For a numerical comparison suppose Hj represents a fixed time recovery that 
takes one second, and suppose is the uniform distribution from zero to one 
second. In terms of hours the means and variances are 

Wj = 2.78 x 10' 4 a x 2 = 0 

y 2 = 1.39 x 10- 4 a 2 2 - 6.43 x 10" 9 

If the component fault rate and operating time are 

X = 5 x 10* 4 per hour 
T = 1 hour 

then the inequalities are 

4.16 x 10- 10 >_ D^l) >_ 4.02 x 10" 10 

1.56 x 10- 13 >_ 0 2 (1) _> 1.45 x 10- 13 . 
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4. ALGEBRAIC BOUNDS 


The upper and lower bounds derived In section two become completely 
algebraic when algebraic bounds are provided for E(S), the probability of 
traversing the path in figure 2 by time S. Jumping ahead to the next theorem 
and using the notation In figure 2, these bounds are 


x 1 ...x k s 

in 


x,...x.s N s(x, 

> ECS) >— k , k [1 - 


Yj +...+ X k + Y k ) i 

— m J 


Letting 


_ Upper Bound - Lower Bound _ S ^1 + Y l + *** + *k + T^) 

Upper Bound k+1 

it can be seen that the algebraic bounds for E(S) are accurate when the product 
of the operating time and the sum of the fault rates is small. 


Theorem With the notation in figure 2, 
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Proof 


The upper bound is the easier. 
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k-1 
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m 
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For the lower bound, begin with k - 1. 


E (S ) = X ‘— (1 - e‘ lx i *^ )% ) 

h l W X x + T1 

> — — (1 - 1 + (X x + Yi)S 
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Assume the lower bound is true for k - n. 
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0 


n! 
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(n+1) 1 _ 


n+1 


X x X 2 ... X n+1 (X 2 + Y 2 + •**+ X n+1 + Y n+1^ 


.n+2 


Xj x 2 ... X n+1 (x x + Yi) S 
n!(n+l) (n+k:) 


(n+2yr 

n+2 


S n+1 S(, + Y1 + ... + X + y ) 

x i - Vl _ „ _ Xl Y . 


Tn+iyr 


[i 


n+2 


The theorem is proved. 



5. ALGEBRAIC EXAMPLE 


This section illustrates using the algebraic upper bound as an 
approximation formula. Consider the first two failure states for a triad plus a 
spare depicted in figure 3. The algebraic upper bounds are 

D X (T) - 6 X 2 T u x 

D 2 (T) - 9 X 3 T 2 w 2 

where X is the component fault rate, T is the operating time, and pj is the 
mean of the ith system recovery. The first failure is linear in operating time, 
linear in average recovery time, and quadratic in component fault rate. The 
ratio 


D 2 (T) 3 X T p 2 

DTTTT = 2 m 

says that if p 2 is approximately equal to m then D 2 is smaller than 0 X by a 
factor of about XT. For common values of X and T, D 2 is several orders of 
magnitude smaller than D x . 

The technique above can be applied to a complete reliability model to 
identify the dominant failure modes and the important parameters. 
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6. TIME DEPENDENT RECOVERY 


This section shows that the upper bound established for semi-Markov models 
in section two is still an upper bound when system fault recovery is time 
dependent. The algebraic upper bound derived in section four also remains 
valid. All the assumptions remain the same except that fault recovery is time 
and path dependent. 

Consider state j on a path. Let F (t x ,. . . »tj_i) be the probability that 
the holding time in state i is less than or equal to tj for 1 <_ i <. j - 1. 

Let H [tj ,... »tj_i](tj) be the distribution function for fault recovery in 
state j given the holding time in state i is t-j for 1 <_ 1 j ~ 1« The item 
of interest is the conditional mean 

T » 

I [/ t. d H [t x .... ,t . i](t.)] d F ]) 

0 0 J J J 

U j = T 

/ d F (tj,...,^ ,) 

0 J 

which is the average recovery time for state j given the system reaches state j 
on the path being considered by time T. Note that recovery time in state j can 
depend not only on the time of entry into state j, which is tj+... + tj.j, but 
also on the intermediate states and the holding time in each of the intermediate 
states. 

The demonstration that the same upper bound remains valid proceeds 
inductively by removing expressions containing recovery distributions from the 
integral giving the probability of traversing a path by time T. The expression 
containing a recovery distribution is replaced by a factor of 1 if the 
transition is a recovery competing with component failures. It Is replaced by a 
factor of aj uj if the transition is a component failure with rate aj 
competing with other component failures and with a recovery that has conditional 
mean pi. The general case in the inductive step where the transition on the 
path at state j is a recovery is described by the iterated integral 



/ d F (t j » • • • »t . -i) 
0 J 


T “t !“*••• “t • 


/ 

0 


j-i - e .t, 

e J J d H [tj , — »tj j 3(t j ) 


T-t,-...-t . 

i 3 d G <Vi> 


where F and H are as described above and G is a composition of constant failure 
rate transitions competing with other constant failure rate transitions. As a 
distribution representing the sum of sojourn times associated with component 
failures, G is time independent. At this point in the induction, the 
transitions involving a recovery that have occurred after state j have been 
replaced by their upper bounds. Clearly the last expression is less than or 
equal to 

J d F (t x t . ,) J J-i d G(t . ) . 

0 J 0 J 


Consider the general case in the inductive step where the transition at state j 
that is on the path is a component failure with rate aj. It competes with a 
recovery, dH, and other component failures, rate Bj. The iterated integral is 

T 

j d F (t j , • * • ,t • .) 

0 J 


T-t 1 -...-t j ._ 1 -( a 4* )t. 

I a.e J J J [1 - H[t 1 ,...,tj_ 1 ](tj)] dtj 


0 "J 

T-t t " • • • “t • 

r 

0 


T-t i • • • • “t . 

r 3 d g (t jtl ) 
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The theorem at the end of this section shows that the last Iterated Integral Is 
less than or equal to 


/ d F («! , . . . ,<u . . ) / 
0 J-i 0 


T-wi — • • . 

r J-I 


d G <Vi> 


/ d F (t lf ...t. j) / a.e ( ° j 8j)tj [1 - 1 3(t .)] dt. 

0 J o J J" 1 *» J 

T — 

/ d F (wj w. . ) 

0 J_1 

The expression In the braces Is less than or equal to aj uj» 

Hence the reliability model with the time dependent recovery has the same 
upper bound as the semi -Markov reliability model. 

Theorem With the notation as above 
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Proof 

Let 


V ■.) - j 

J 0 






[1 - H [x lt 



and note that v (xi,...,x. ,) <_ 1. 
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Consider the difference 
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The theorem is proved, 
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